import pandas as pd
import numpy as np
import dalex
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
df = pd.read_csv("housing.csv")
df.describe()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
| mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
| std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
| 50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
df = df.dropna()
df["ocean_proximity"].unique()
array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
dtype=object)
le = LabelEncoder()
df["ocean_proximity"] = le.fit_transform(df["ocean_proximity"])
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = RandomForestRegressor(n_estimators=5, random_state=0)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
MSE = metrics.mean_squared_error(y_test, y_pred)
exp = dalex.Explainer(regr, X_test, y_test)
Preparation of a new explainer is initiated -> data : 4087 rows 9 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 4087 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x0000028752051FC0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 4.41e+04, mean = 2.08e+05, max = 5e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -3.16e+05, mean = -6.51e+02, max = 3.19e+05 -> model_info : package sklearn A new explainer has been created!
d:\coding\daily\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but RandomForestRegressor was fitted with feature names warnings.warn(
observation = X_test.iloc[[100, 200]]
observation_pred = regr.predict(observation)
order = X_test.columns.to_list()
exp.predict_parts(observation.iloc[[0]], type="break_down", order=order).plot()
exp.predict_parts(observation.iloc[[0]], type="shap").plot()
Break down plot shows that longitude has largest positive contribution to prediction while latitude has largest negative contribution. We may however expect that those two variables are interacting and they comes down to location, not just single coordinate. Total rooms variable impact changes from positive to negative between plots which suggest that it may be interacting with another variable.
exp.predict_parts(observation.iloc[[1]], type="break_down", order=order).plot()
exp.predict_parts(observation.iloc[[1]], type="shap").plot()
Most variables have opposite to the first observation influence on prediction. Latitude and longitude variables contributions again suggest interaction.